Download and read “Census Income” data into R and prepare graphical and numerical summaries of it: e.g. histograms of continuous attributes, contingency tables of categorical variables, scatterplots of continuous attributes with some of the categorical variables indicated by color/symbol shape, etc. Perform principal components analysis of this data (do you need to scale it prior to that? how would you represent multilevel categorical attributes to be used as inputs for PCA?) and plot observations in the space of the first few principal components with subjects’ gender and/or categorized income indicated by color/shape of the symbol. Perform univariate assessment of associations between outcome we will be modeling and each of the attributes (e.g. t-test or logistic regression for continuous attributes, contingency tables/Fisher exact test/\(\chi^2\) test for categorical attributes). Summarize your observations from these assessments: does it appear that there is association between outcome and predictors? Which predictors seem to be more/less relevant?
adultDat <- read.csv("adult.data", header = FALSE)
#Skipping first row as it contains a strange string
adultTest <- read.csv("adult.test", skip=1, header=FALSE)
#Take a peek
head(adultDat)
## V1 V2 V3 V4 V5 V6
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## V7 V8 V9 V10 V11 V12 V13
## 1 Adm-clerical Not-in-family White Male 2174 0 40
## 2 Exec-managerial Husband White Male 0 0 13
## 3 Handlers-cleaners Not-in-family White Male 0 0 40
## 4 Handlers-cleaners Husband Black Male 0 0 40
## 5 Prof-specialty Wife Black Female 0 0 40
## 6 Exec-managerial Wife White Female 0 0 40
## V14 V15
## 1 United-States <=50K
## 2 United-States <=50K
## 3 United-States <=50K
## 4 United-States <=50K
## 5 Cuba <=50K
## 6 United-States <=50K
head(adultTest)
## V1 V2 V3 V4 V5 V6
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 44 Private 160323 Some-college 10 Married-civ-spouse
## 5 18 ? 103497 Some-college 10 Never-married
## 6 34 Private 198693 10th 6 Never-married
## V7 V8 V9 V10 V11 V12 V13
## 1 Machine-op-inspct Own-child Black Male 0 0 40
## 2 Farming-fishing Husband White Male 0 0 50
## 3 Protective-serv Husband White Male 0 0 40
## 4 Machine-op-inspct Husband Black Male 7688 0 40
## 5 ? Own-child White Female 0 0 30
## 6 Other-service Not-in-family White Male 0 0 30
## V14 V15
## 1 United-States <=50K.
## 2 United-States <=50K.
## 3 United-States >50K.
## 4 United-States >50K.
## 5 United-States <=50K.
## 6 United-States <=50K.
#Assign names, taken from the corresponding txt files
adultNames <- unique(str_match(readLines("old.adult.names.txt"), "^[a-zA-Z-]+"))
adultNames <- adultNames[!is.na(adultNames)]
#Replace dashes with dots for easier reference
adultNames <-gsub('-','.',adultNames)
#Assign names
colnames(adultDat) <- adultNames
head(adultDat)
## age workclass fnlwgt education education.num
## 1 39 State-gov 77516 Bachelors 13
## 2 50 Self-emp-not-inc 83311 Bachelors 13
## 3 38 Private 215646 HS-grad 9
## 4 53 Private 234721 11th 7
## 5 28 Private 338409 Bachelors 13
## 6 37 Private 284582 Masters 14
## marital.status occupation relationship race sex
## 1 Never-married Adm-clerical Not-in-family White Male
## 2 Married-civ-spouse Exec-managerial Husband White Male
## 3 Divorced Handlers-cleaners Not-in-family White Male
## 4 Married-civ-spouse Handlers-cleaners Husband Black Male
## 5 Married-civ-spouse Prof-specialty Wife Black Female
## 6 Married-civ-spouse Exec-managerial Wife White Female
## capital.gain capital.loss hours.per.week native.country class
## 1 2174 0 40 United-States <=50K
## 2 0 0 13 United-States <=50K
## 3 0 0 40 United-States <=50K
## 4 0 0 40 United-States <=50K
## 5 0 0 40 Cuba <=50K
## 6 0 0 40 United-States <=50K
colnames(adultTest) <- adultNames
head(adultTest)
## age workclass fnlwgt education education.num marital.status
## 1 25 Private 226802 11th 7 Never-married
## 2 38 Private 89814 HS-grad 9 Married-civ-spouse
## 3 28 Local-gov 336951 Assoc-acdm 12 Married-civ-spouse
## 4 44 Private 160323 Some-college 10 Married-civ-spouse
## 5 18 ? 103497 Some-college 10 Never-married
## 6 34 Private 198693 10th 6 Never-married
## occupation relationship race sex capital.gain
## 1 Machine-op-inspct Own-child Black Male 0
## 2 Farming-fishing Husband White Male 0
## 3 Protective-serv Husband White Male 0
## 4 Machine-op-inspct Husband Black Male 7688
## 5 ? Own-child White Female 0
## 6 Other-service Not-in-family White Male 0
## capital.loss hours.per.week native.country class
## 1 0 40 United-States <=50K.
## 2 0 50 United-States <=50K.
## 3 0 40 United-States >50K.
## 4 0 40 United-States >50K.
## 5 0 30 United-States <=50K.
## 6 0 30 United-States <=50K.
#Check that numerics are read in correctly and characters read as factors
summary(adultDat)
## age workclass fnlwgt
## Min. :17.00 Private :22696 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827
## Median :37.00 Local-gov : 2093 Median : 178356
## Mean :38.58 ? : 1836 Mean : 189778
## 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051
## Max. :90.00 Self-emp-inc : 1116 Max. :1484705
## (Other) : 981
## education education.num marital.status
## HS-grad :10501 Min. : 1.00 Divorced : 4443
## Some-college: 7291 1st Qu.: 9.00 Married-AF-spouse : 23
## Bachelors : 5355 Median :10.00 Married-civ-spouse :14976
## Masters : 1723 Mean :10.08 Married-spouse-absent: 418
## Assoc-voc : 1382 3rd Qu.:12.00 Never-married :10683
## 11th : 1175 Max. :16.00 Separated : 1025
## (Other) : 5134 Widowed : 993
## occupation relationship
## Prof-specialty :4140 Husband :13193
## Craft-repair :4099 Not-in-family : 8305
## Exec-managerial:4066 Other-relative: 981
## Adm-clerical :3770 Own-child : 5068
## Sales :3650 Unmarried : 3446
## Other-service :3295 Wife : 1568
## (Other) :9541
## race sex capital.gain
## Amer-Indian-Eskimo: 311 Female:10771 Min. : 0
## Asian-Pac-Islander: 1039 Male :21790 1st Qu.: 0
## Black : 3124 Median : 0
## Other : 271 Mean : 1078
## White :27816 3rd Qu.: 0
## Max. :99999
##
## capital.loss hours.per.week native.country class
## Min. : 0.0 Min. : 1.00 United-States:29170 <=50K:24720
## 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643 >50K : 7841
## Median : 0.0 Median :40.00 ? : 583
## Mean : 87.3 Mean :40.44 Philippines : 198
## 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 137
## Max. :4356.0 Max. :99.00 Canada : 121
## (Other) : 1709
summary(adultTest)
## age workclass fnlwgt
## Min. :17.00 Private :11210 Min. : 13492
## 1st Qu.:28.00 Self-emp-not-inc: 1321 1st Qu.: 116736
## Median :37.00 Local-gov : 1043 Median : 177831
## Mean :38.77 ? : 963 Mean : 189436
## 3rd Qu.:48.00 State-gov : 683 3rd Qu.: 238384
## Max. :90.00 Self-emp-inc : 579 Max. :1490400
## (Other) : 482
## education education.num marital.status
## HS-grad :5283 Min. : 1.00 Divorced :2190
## Some-college:3587 1st Qu.: 9.00 Married-AF-spouse : 14
## Bachelors :2670 Median :10.00 Married-civ-spouse :7403
## Masters : 934 Mean :10.07 Married-spouse-absent: 210
## Assoc-voc : 679 3rd Qu.:12.00 Never-married :5434
## 11th : 637 Max. :16.00 Separated : 505
## (Other) :2491 Widowed : 525
## occupation relationship
## Prof-specialty :2032 Husband :6523
## Exec-managerial:2020 Not-in-family :4278
## Craft-repair :2013 Other-relative: 525
## Sales :1854 Own-child :2513
## Adm-clerical :1841 Unmarried :1679
## Other-service :1628 Wife : 763
## (Other) :4893
## race sex capital.gain
## Amer-Indian-Eskimo: 159 Female: 5421 Min. : 0
## Asian-Pac-Islander: 480 Male :10860 1st Qu.: 0
## Black : 1561 Median : 0
## Other : 135 Mean : 1082
## White :13946 3rd Qu.: 0
## Max. :99999
##
## capital.loss hours.per.week native.country class
## Min. : 0.0 Min. : 1.00 United-States:14662 <=50K.:12435
## 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 308 >50K. : 3846
## Median : 0.0 Median :40.00 ? : 274
## Mean : 87.9 Mean :40.39 Philippines : 97
## 3rd Qu.: 0.0 3rd Qu.:45.00 Puerto-Rico : 70
## Max. :3770.0 Max. :99.00 Germany : 69
## (Other) : 801
#From here we can see that the numerics values are clean (no NAs).
#Will transpose and merge the unique values of each character variable
#to see what may need cleaning before we concatenate the datasets
sapply(adultDat,class)
## age workclass fnlwgt education education.num
## "integer" "factor" "integer" "factor" "integer"
## marital.status occupation relationship race sex
## "factor" "factor" "factor" "factor" "factor"
## capital.gain capital.loss hours.per.week native.country class
## "integer" "integer" "integer" "factor" "factor"
#Store bool of Character and Numeric variables
charVect <- sapply(adultDat, class) %in% c('character', 'factor')
numVect <- sapply(adultDat, class) %in% c('integer')
#Distinct categorical data in each data frame
adultLong <- melt(as.matrix(adultDat[,charVect])) %>% group_by(Var2,value) %>% summarise(n.adult=n_distinct(value))
testLong <- melt(as.matrix(adultTest[,charVect])) %>% group_by(Var2,value) %>% summarise(n.test=n_distinct(value))
#Values that will need to be investigated
merge(adultLong,testLong, by=c("Var2","value"),all=TRUE) %>% filter(is.na((n.adult+n.test)))
## Var2 value n.adult n.test
## 1 native.country Holand-Netherlands 1 NA
## 2 class <=50K 1 NA
## 3 class >50K 1 NA
## 4 class <=50K. NA 1
## 5 class >50K. NA 1
#Holland-Netherlands is a valid country name, so will just compress the . from the class variable in the test dataset.
adultTest$class <- gsub('[.]','',adultTest$class)
#We can also see that we have some missing values in the following variables that we will have to look into imputing
adultLong[adultLong$value==' ?',1]
## Source: local data frame [3 x 1]
## Groups: Var2 [3]
##
## Var2
## <fctr>
## 1 workclass
## 2 occupation
## 3 native.country
#Join our cleaned datasets
incomeDat <- rbind(adultDat,adultTest)
#Checking that the class of variables is preserved
summary(incomeDat)
## age workclass fnlwgt
## Min. :17.00 Private :33906 Min. : 12285
## 1st Qu.:28.00 Self-emp-not-inc: 3862 1st Qu.: 117550
## Median :37.00 Local-gov : 3136 Median : 178144
## Mean :38.64 ? : 2799 Mean : 189664
## 3rd Qu.:48.00 State-gov : 1981 3rd Qu.: 237642
## Max. :90.00 Self-emp-inc : 1695 Max. :1490400
## (Other) : 1463
## education education.num marital.status
## HS-grad :15784 Min. : 1.00 Divorced : 6633
## Some-college:10878 1st Qu.: 9.00 Married-AF-spouse : 37
## Bachelors : 8025 Median :10.00 Married-civ-spouse :22379
## Masters : 2657 Mean :10.08 Married-spouse-absent: 628
## Assoc-voc : 2061 3rd Qu.:12.00 Never-married :16117
## 11th : 1812 Max. :16.00 Separated : 1530
## (Other) : 7625 Widowed : 1518
## occupation relationship
## Prof-specialty : 6172 Husband :19716
## Craft-repair : 6112 Not-in-family :12583
## Exec-managerial: 6086 Other-relative: 1506
## Adm-clerical : 5611 Own-child : 7581
## Sales : 5504 Unmarried : 5125
## Other-service : 4923 Wife : 2331
## (Other) :14434
## race sex capital.gain
## Amer-Indian-Eskimo: 470 Female:16192 Min. : 0
## Asian-Pac-Islander: 1519 Male :32650 1st Qu.: 0
## Black : 4685 Median : 0
## Other : 406 Mean : 1079
## White :41762 3rd Qu.: 0
## Max. :99999
##
## capital.loss hours.per.week native.country class
## Min. : 0.0 Min. : 1.00 United-States:43832 <=50K:37155
## 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 951 >50K :11687
## Median : 0.0 Median :40.00 ? : 857
## Mean : 87.5 Mean :40.42 Philippines : 295
## 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 206
## Max. :4356.0 Max. :99.00 Puerto-Rico : 184
## (Other) : 2517
#Class looks to fixed now
missWrkClss <- incomeDat[incomeDat$workclass==' ?',]
missOcupatn <- incomeDat[incomeDat$occupation==' ?',]
missNatCntr <- incomeDat[incomeDat$native.country==' ?',]
#For the time being, going to replace vars with ? with NA
incomeDat[,charVect] <- apply(incomeDat[,charVect],2,function(x) gsub(' [?]',NA,x))
#Remove the leading spaces from the character variables
incomeDat[,charVect] <- apply(incomeDat[,charVect],2,function(x) gsub('^ ','',x))
#Converting back to factor
incomeDat <- as.data.frame(unclass(incomeDat))
#Remove all records with NA
nIncomeDat <- na.omit(incomeDat)
#Lets see how much data we lose by removing NAs
(nrow(incomeDat)-nrow(nIncomeDat))/nrow(incomeDat)*100
## [1] 7.411654
#We lose about 7.5% of the data, however it seems more hassle than its worth to impute values for this amount of data, so will continue with these. Especially since we cannot guarantee whether the missings are random or if there is an underlying pattern.
#Dataset is too big for testing, taking a random sample of 2k
sampIncome <- data.frame(nIncomeDat[sample(1:nrow(nIncomeDat), 2000,
replace=FALSE),])
#Categorical data
#Contingency table
nIncomeDat[,charVect] %>% melt(id="class") %>% group_by(variable,value,class) %>% summarise(n=n()) %>% dcast(variable+value~class,value.var="n") %>% kable()
## Warning: attributes are not identical across measure variables; they will
## be dropped
| variable | value | <=50K | >50K |
|---|---|---|---|
| workclass | Federal-gov | 857 | 549 |
| workclass | Local-gov | 2185 | 915 |
| workclass | Private | 26056 | 7251 |
| workclass | Self-emp-inc | 734 | 912 |
| workclass | Self-emp-not-inc | 2737 | 1059 |
| workclass | State-gov | 1426 | 520 |
| workclass | Without-pay | 19 | 2 |
| education | 10th | 1141 | 82 |
| education | 11th | 1530 | 89 |
| education | 12th | 534 | 43 |
| education | 1st-4th | 214 | 8 |
| education | 5th-6th | 427 | 22 |
| education | 7th-8th | 768 | 55 |
| education | 9th | 638 | 38 |
| education | Assoc-acdm | 1109 | 398 |
| education | Assoc-voc | 1455 | 504 |
| education | Bachelors | 4392 | 3178 |
| education | Doctorate | 145 | 399 |
| education | HS-grad | 12367 | 2416 |
| education | Masters | 1121 | 1393 |
| education | Preschool | 71 | 1 |
| education | Prof-school | 193 | 592 |
| education | Some-college | 7909 | 1990 |
| marital.status | Divorced | 5642 | 655 |
| marital.status | Married-AF-spouse | 18 | 14 |
| marital.status | Married-civ-spouse | 11491 | 9564 |
| marital.status | Married-spouse-absent | 498 | 54 |
| marital.status | Never-married | 13897 | 701 |
| marital.status | Separated | 1312 | 99 |
| marital.status | Widowed | 1156 | 121 |
| occupation | Adm-clerical | 4784 | 756 |
| occupation | Armed-Forces | 10 | 4 |
| occupation | Craft-repair | 4665 | 1355 |
| occupation | Exec-managerial | 3117 | 2867 |
| occupation | Farming-fishing | 1308 | 172 |
| occupation | Handlers-cleaners | 1911 | 135 |
| occupation | Machine-op-inspct | 2605 | 365 |
| occupation | Other-service | 4612 | 196 |
| occupation | Priv-house-serv | 229 | 3 |
| occupation | Prof-specialty | 3304 | 2704 |
| occupation | Protective-serv | 669 | 307 |
| occupation | Sales | 3953 | 1455 |
| occupation | Tech-support | 1009 | 411 |
| occupation | Transport-moving | 1838 | 478 |
| relationship | Husband | 10159 | 8507 |
| relationship | Not-in-family | 10474 | 1228 |
| relationship | Other-relative | 1299 | 50 |
| relationship | Own-child | 6521 | 105 |
| relationship | Unmarried | 4486 | 302 |
| relationship | Wife | 1075 | 1016 |
| race | Amer-Indian-Eskimo | 382 | 53 |
| race | Asian-Pac-Islander | 934 | 369 |
| race | Black | 3694 | 534 |
| race | Other | 308 | 45 |
| race | White | 28696 | 10207 |
| sex | Female | 13026 | 1669 |
| sex | Male | 20988 | 9539 |
| native.country | Cambodia | 17 | 9 |
| native.country | Canada | 103 | 60 |
| native.country | China | 77 | 36 |
| native.country | Columbia | 78 | 4 |
| native.country | Cuba | 99 | 34 |
| native.country | Dominican-Republic | 92 | 5 |
| native.country | Ecuador | 37 | 6 |
| native.country | El-Salvador | 136 | 11 |
| native.country | England | 72 | 47 |
| native.country | France | 20 | 16 |
| native.country | Germany | 135 | 58 |
| native.country | Greece | 31 | 18 |
| native.country | Guatemala | 83 | 3 |
| native.country | Haiti | 60 | 9 |
| native.country | Holand-Netherlands | 1 | NA |
| native.country | Honduras | 17 | 2 |
| native.country | Hong | 20 | 8 |
| native.country | Hungary | 12 | 6 |
| native.country | India | 85 | 62 |
| native.country | Iran | 34 | 22 |
| native.country | Ireland | 26 | 10 |
| native.country | Italy | 67 | 33 |
| native.country | Jamaica | 89 | 14 |
| native.country | Japan | 58 | 31 |
| native.country | Laos | 19 | 2 |
| native.country | Mexico | 856 | 47 |
| native.country | Nicaragua | 45 | 3 |
| native.country | Outlying-US(Guam-USVI-etc) | 21 | 1 |
| native.country | Peru | 41 | 4 |
| native.country | Philippines | 199 | 84 |
| native.country | Poland | 65 | 16 |
| native.country | Portugal | 50 | 12 |
| native.country | Puerto-Rico | 155 | 20 |
| native.country | Scotland | 18 | 2 |
| native.country | South | 83 | 18 |
| native.country | Taiwan | 30 | 25 |
| native.country | Thailand | 24 | 5 |
| native.country | Trinadad&Tobago | 24 | 2 |
| native.country | United-States | 30844 | 10448 |
| native.country | Vietnam | 76 | 7 |
| native.country | Yugoslavia | 15 | 8 |
#We have a lot of granularity in the categorical data which we could do without
#Looking at the education level and education num, it looks like we dont need both education and education.num. Furthermore its prob better to collapse the levels also and turn it into dummy variables if possible
table(sampIncome$education,sampIncome$education.num)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 10th 0 0 0 0 0 45 0 0 0 0 0 0 0 0 0
## 11th 0 0 0 0 0 0 78 0 0 0 0 0 0 0 0
## 12th 0 0 0 0 0 0 0 23 0 0 0 0 0 0 0
## 1st-4th 0 11 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5th-6th 0 0 22 0 0 0 0 0 0 0 0 0 0 0 0
## 7th-8th 0 0 0 35 0 0 0 0 0 0 0 0 0 0 0
## 9th 0 0 0 0 36 0 0 0 0 0 0 0 0 0 0
## Assoc-acdm 0 0 0 0 0 0 0 0 0 0 0 67 0 0 0
## Assoc-voc 0 0 0 0 0 0 0 0 0 0 98 0 0 0 0
## Bachelors 0 0 0 0 0 0 0 0 0 0 0 0 325 0 0
## Doctorate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## HS-grad 0 0 0 0 0 0 0 0 642 0 0 0 0 0 0
## Masters 0 0 0 0 0 0 0 0 0 0 0 0 0 102 0
## Preschool 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Prof-school 0 0 0 0 0 0 0 0 0 0 0 0 0 0 36
## Some-college 0 0 0 0 0 0 0 0 0 456 0 0 0 0 0
##
## 16
## 10th 0
## 11th 0
## 12th 0
## 1st-4th 0
## 5th-6th 0
## 7th-8th 0
## 9th 0
## Assoc-acdm 0
## Assoc-voc 0
## Bachelors 0
## Doctorate 22
## HS-grad 0
## Masters 0
## Preschool 0
## Prof-school 0
## Some-college 0
table(sampIncome$marital.status,sampIncome$relationship)
##
## Husband Not-in-family Other-relative Own-child
## Divorced 0 145 7 12
## Married-AF-spouse 0 0 0 0
## Married-civ-spouse 816 1 5 8
## Married-spouse-absent 0 13 1 1
## Never-married 0 317 43 258
## Separated 0 25 4 5
## Widowed 0 21 3 0
##
## Unmarried Wife
## Divorced 117 0
## Married-AF-spouse 0 0
## Married-civ-spouse 0 93
## Married-spouse-absent 4 0
## Never-married 44 0
## Separated 34 0
## Widowed 23 0
#Bar char colored by outcome
sampIncome[,charVect] %>% melt(id.vars="class") %>%
ggplot(aes(x=value,fill=class)) + geom_bar() + facet_wrap(~variable, scales = "free")
## Warning: attributes are not identical across measure variables; they will
## be dropped
#Performing Chi-sq test on the whole dataset to see which variables are independent of income class
lapply(nIncomeDat[,charVect], function(x) {
chisq.test(x, nIncomeDat$class)
})
## Warning in chisq.test(x, nIncomeDat$class): Chi-squared approximation may
## be incorrect
## Warning in chisq.test(x, nIncomeDat$class): Chi-squared approximation may
## be incorrect
## $workclass
##
## Pearson's Chi-squared test
##
## data: x and nIncomeDat$class
## X-squared = 1207.3, df = 6, p-value < 2.2e-16
##
##
## $education
##
## Pearson's Chi-squared test
##
## data: x and nIncomeDat$class
## X-squared = 5996, df = 15, p-value < 2.2e-16
##
##
## $marital.status
##
## Pearson's Chi-squared test
##
## data: x and nIncomeDat$class
## X-squared = 9109.2, df = 6, p-value < 2.2e-16
##
##
## $occupation
##
## Pearson's Chi-squared test
##
## data: x and nIncomeDat$class
## X-squared = 5415.1, df = 13, p-value < 2.2e-16
##
##
## $relationship
##
## Pearson's Chi-squared test
##
## data: x and nIncomeDat$class
## X-squared = 9357, df = 5, p-value < 2.2e-16
##
##
## $race
##
## Pearson's Chi-squared test
##
## data: x and nIncomeDat$class
## X-squared = 452.3, df = 4, p-value < 2.2e-16
##
##
## $sex
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: x and nIncomeDat$class
## X-squared = 2104.1, df = 1, p-value < 2.2e-16
##
##
## $native.country
##
## Pearson's Chi-squared test
##
## data: x and nIncomeDat$class
## X-squared = 456.01, df = 40, p-value < 2.2e-16
##
##
## $class
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: x and nIncomeDat$class
## X-squared = 45217, df = 1, p-value < 2.2e-16
#It seems that all variables are related to the outcome variable with a very strong significance
#Continuous variables
#Creates color pallette based on unique column values
#taken from ISLR
Cols=function(vec){
cols=rainbow(length(unique(vec)))
return(cols[as.numeric(as.factor(vec))])
}
#Scatterplot matrix, colored by income class
pairs(~.,data=sampIncome[,numVect],
col=Cols(sampIncome$class))
#Here age seems to be correlated with weight (fnlwgt) as is expected
#Histogram plot
as.matrix(sampIncome[,numVect]) %>% melt() %>% mutate(value=as.numeric(value)) %>% ggplot(aes(x=value)) + geom_histogram() + facet_wrap(~Var2, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
lapply(nIncomeDat[,numVect], function(x) t.test(x ~ nIncomeDat$class, var.equal = TRUE))
## $age
##
## Two Sample t-test
##
## data: x by nIncomeDat$class
## t = -51.885, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.530767 -6.982514
## sample estimates:
## mean in group <=50K mean in group >50K
## 36.74943 44.00607
##
##
## $fnlwgt
##
## Two Sample t-test
##
## data: x by nIncomeDat$class
## t = 1.5447, df = 45220, p-value = 0.1224
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -477.8235 4032.3173
## sample estimates:
## mean in group <=50K mean in group >50K
## 190175.2 188398.0
##
##
## $education.num
##
## Two Sample t-test
##
## data: x by nIncomeDat$class
## t = -75.048, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.019122 -1.916340
## sample estimates:
## mean in group <=50K mean in group >50K
## 9.63077 11.59850
##
##
## $capital.gain
##
## Two Sample t-test
##
## data: x by nIncomeDat$class
## t = -48.195, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3999.048 -3686.488
## sample estimates:
## mean in group <=50K mean in group >50K
## 149.0234 3991.7917
##
##
## $capital.loss
##
## Two Sample t-test
##
## data: x by nIncomeDat$class
## t = -31.974, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -148.0033 -130.9059
## sample estimates:
## mean in group <=50K mean in group >50K
## 54.03243 193.48706
##
##
## $hours.per.week
##
## Two Sample t-test
##
## data: x by nIncomeDat$class
## t = -49.611, df = 45220, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.568092 -6.068839
## sample estimates:
## mean in group <=50K mean in group >50K
## 39.37202 45.69049
#fnlWgt is the only variable that has no relationship with the outcome variable. We may consider dropping this
nvClass <- c(numVect[1:14],TRUE)
sampIncome[,nvClass] %>% melt(id.vars="class") %>% ggplot(aes(x=variable, y=value,group=class,fill=class)) + geom_boxplot() + facet_wrap(~variable, scales = "free")
#drop fnlwgt
#log transform age
#drop education.num
#pre-college (1:10),
plot(nIncomeDat$education.num,col=Cols(nIncomeDat$class))
#Although education matters, it doesnt seem to be a large distinction prior to high school
cIncomeDat <- nIncomeDat
cIncomeDat$education <- as.character(cIncomeDat$education)
cIncomeDat$education[cIncomeDat$education.num <=10] <- "Pre-College"
cIncomeDat$education<- as.factor(cIncomeDat$education)
#Drop education num
cIncomeDat$education.num <- NULL
#Drop final weight
cIncomeDat$fnlwgt <- NULL
cIncomeDat$age <- log(cIncomeDat$age)
#Recoding countries to regions
cIncomeDat$region <- countrycode(cIncomeDat$native.country,"country.name","region")
## Warning in countrycode(cIncomeDat$native.country, "country.name", "region"): Some values were not matched unambiguously: Columbia, England, Hong, Scotland, South, Yugoslavia
#Some issues converting columbia, england, hong, scotland, south and yugoslavia
#need to fix these and try again
cIncomeDat$native.country <- as.character(cIncomeDat$native.country)
cIncomeDat$native.country[cIncomeDat$native.country=="Columbia"] <- "Colombia"
cIncomeDat$native.country[cIncomeDat$native.country=="Hong"] <- "Hong Kong"
cIncomeDat$native.country[cIncomeDat$native.country %in% c("England","Scotland")] <- "UK"
#For the sake of argument, just picking one
cIncomeDat$native.country[cIncomeDat$native.country %in% c("Yugoslavia")] <- "Bosnia"
cIncomeDat <- cIncomeDat[cIncomeDat$native.country!="South",]
cIncomeDat$region <-( countrycode(cIncomeDat$native.country,"country.name","region"))
cIncomeDat$region <- as.factor(cIncomeDat$region)
cIncomeDat$native.country <- NULL
#Creating our data matrix that will be used for the rest of the analysis
nrow(cIncomeDat)
## [1] 45121
dm <- model.matrix(~.+0, data=cIncomeDat)
#We have one variable (workclass never worked) with no values, we will remove this
dm <- dm[,-4]
summary(dm)
## age workclassFederal-gov workclassLocal-gov workclassPrivate
## Min. :2.833 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:3.332 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :3.611 Median :0.00000 Median :0.0000 Median :1.0000
## Mean :3.592 Mean :0.03112 Mean :0.0687 Mean :0.7369
## 3rd Qu.:3.850 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :4.500 Max. :1.00000 Max. :1.0000 Max. :1.0000
## workclassSelf-emp-inc workclassSelf-emp-not-inc workclassState-gov
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.03621 Mean :0.08351 Mean :0.04311
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## workclassWithout-pay educationAssoc-voc educationBachelors
## Min. :0.0000000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000000 Median :0.00000 Median :0.0000
## Mean :0.0004654 Mean :0.04335 Mean :0.1671
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.0000000 Max. :1.00000 Max. :1.0000
## educationDoctorate educationMasters educationPre-College
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :1.0000
## Mean :0.01201 Mean :0.05552 Mean :0.6713
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
## educationProf-school marital.statusMarried-AF-spouse
## Min. :0.00000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.0000000
## Median :0.00000 Median :0.0000000
## Mean :0.01735 Mean :0.0007092
## 3rd Qu.:0.00000 3rd Qu.:0.0000000
## Max. :1.00000 Max. :1.0000000
## marital.statusMarried-civ-spouse marital.statusMarried-spouse-absent
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.4655 Mean :0.01217
## 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
## marital.statusNever-married marital.statusSeparated marital.statusWidowed
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.3229 Mean :0.03123 Mean :0.02819
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000
## occupationArmed-Forces occupationCraft-repair occupationExec-managerial
## Min. :0.0000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000000 Median :0.0000 Median :0.0000
## Mean :0.0003103 Mean :0.1332 Mean :0.1322
## 3rd Qu.:0.0000000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000000 Max. :1.0000 Max. :1.0000
## occupationFarming-fishing occupationHandlers-cleaners
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.0328 Mean :0.04528
## 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
## occupationMachine-op-inspct occupationOther-service
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.06573 Mean :0.1062
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## occupationPriv-house-serv occupationProf-specialty
## Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.0000
## Mean :0.005142 Mean :0.1328
## 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.0000
## occupationProtective-serv occupationSales occupationTech-support
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.02163 Mean :0.1192 Mean :0.03147
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
## occupationTransport-moving relationshipNot-in-family
## Min. :0.00000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.00000 Median :0.000
## Mean :0.05131 Mean :0.259
## 3rd Qu.:0.00000 3rd Qu.:1.000
## Max. :1.00000 Max. :1.000
## relationshipOther-relative relationshipOwn-child relationshipUnmarried
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.02968 Mean :0.1465 Mean :0.1058
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
## relationshipWife raceAsian-Pac-Islander raceBlack
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.04616 Mean :0.02671 Mean :0.0937
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
## raceOther raceWhite sexMale capital.gain
## Min. :0.000000 Min. :0.0000 Min. :0.0000 Min. : 0
## 1st Qu.:0.000000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.: 0
## Median :0.000000 Median :1.0000 Median :1.0000 Median : 0
## Mean :0.007823 Mean :0.8622 Mean :0.6752 Mean : 1102
## 3rd Qu.:0.000000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 0
## Max. :1.000000 Max. :1.0000 Max. :1.0000 Max. :99999
## capital.loss hours.per.week class>50K regionCentral America
## Min. : 0.00 Min. : 1.00 Min. :0.000 Min. :0.00000
## 1st Qu.: 0.00 1st Qu.:40.00 1st Qu.:0.000 1st Qu.:0.00000
## Median : 0.00 Median :40.00 Median :0.000 Median :0.00000
## Mean : 88.54 Mean :40.93 Mean :0.248 Mean :0.02666
## 3rd Qu.: 0.00 3rd Qu.:45.00 3rd Qu.:0.000 3rd Qu.:0.00000
## Max. :4356.00 Max. :99.00 Max. :1.000 Max. :1.00000
## regionEastern Asia regionEastern Europe regionMicronesia
## Min. :0.000000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.000000 Median :0.0000000
## Mean :0.006316 Mean :0.002194 Mean :0.0004876
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.000000 Max. :1.0000000
## regionNorthern America regionNorthern Europe regionSouth America
## Min. :0.0000 Min. :0.000000 Min. :0.000000
## 1st Qu.:1.0000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :1.0000 Median :0.000000 Median :0.000000
## Mean :0.9188 Mean :0.003879 Mean :0.003768
## 3rd Qu.:1.0000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000 Max. :1.000000
## regionSouth-Eastern Asia regionSouthern Asia regionSouthern Europe
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.009796 Mean :0.004499 Mean :0.005186
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000
## regionWestern Europe
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.005097
## 3rd Qu.:0.000000
## Max. :1.000000
pr <- prcomp(dm,scale=TRUE)
par(mfrow=c(1,1))
biplot(pr,cex=.5)
#From the biplot, it looks like Sex, and Region contribute heavily to the second principal component
plot(pr)
#We can see that the first two principal components to a good job of explaining a lot of the variance. However, we see that the data may not be so easily separable after that
plot(pr$x[,1:2], col=Cols(cIncomeDat$class))
#The class seems to be rather well defined by the second principal component
plot(pr$x[,1:2], col=Cols(cIncomeDat$sex))
#Gender is rather well defined by the second principal component but not nearly as well separated as class
#Top 5 contributers to PC1 and PC2 respectively
sort(abs(pr$rotation[,1]),decreasing=T)[1:5]
## marital.statusMarried-civ-spouse marital.statusNever-married
## 0.3779606 0.3356694
## class>50K age
## 0.3333022 0.2952995
## relationshipOwn-child
## 0.2600073
sort(abs(pr$rotation[,2]),decreasing=T)[1:5]
## raceAsian-Pac-Islander raceWhite regionNorthern America
## 0.4339857 0.4026078 0.3656846
## regionSouth-Eastern Asia regionEastern Asia
## 0.3109853 0.2436549
set.seed(5124)
#holding out 20% for the test set
gTrain <- sample(c(FALSE,TRUE,TRUE,TRUE,TRUE),nrow(cIncomeDat),replace=TRUE)
train <- cIncomeDat[gTrain,]
test <- cIncomeDat[!gTrain,]